NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do Multi-Document Summarization Models Synthesize ?

https://doi.org/10.1162/tacl_a_00687

DeYoung, Jay; Martinez, Stephanie C; Marshall, Iain J; Wallace, Byron C (September 2024, Transactions of the Association for Computational Linguistics)
Louis, Annie (Ed.)
Abstract Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
more » « less
Full Text Available
Automatically Extracting Numerical Results from RCTs with LLMs

Yun, Hye Sun; Pogrebitskiy, David; Marshall, Iain J; Wallace, Byron C (August 2024, Machine Learning for Healthcare (MLHC))

Full Text Available
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Yun, Hye Sun; Pogrebitskiy, David; Marshall, Iain James; Wallace, Byron C (August 2024, Proceedings of the 9th Machine Learning for Healthcare Conference (MLHC), Proceedings of Machine Learning Research)
Deshpande, Kaivalya; Fiterau, Madalina; Joshi, Shalmali; Lipton, Zachary; Ranganath, Rajesh; Urteaga, Iñigo (Ed.)
Full Text Available
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

Yun, Hye; Marshall, Iain; Trikalinos, Thomas; Wallace, Byron C (December 2023, Empirical Methods in Natural Language Processing (EMNLP))

Full Text Available
Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT-3 (with Varying Success)

Shaib, Chantal; Li, Millicent; Joseph, Sebastian; Marshall, Iain; Li, Junyi Jessy; Wallace, Byron (July 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers))

Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized domains such as biomedicine. In this paper we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given no supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to synthesize evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data, code, and annotations used in this work.
more » « less
Full Text Available
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

https://doi.org/10.18653/v1/2023.emnlp-main.626

Yun, Hye; Marshall, Iain; Trikalinos, Thomas; Wallace, Byron (January 2023, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP))

Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in LLMs offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.
more » « less
Full Text Available
Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges

Ramprasad, Sanjana; Mcinerney, Jered; Marshall, Iain; Wallace, Byron (January 2023, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations)

In this work we present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work, the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality.The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART, and a multi-headed architecture intended to provide greater transparency and controllability to end-users.Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present.The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video can be found at https://vimeo.com/735605060The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/
more » « less
Full Text Available
Paragraph-level Simplification of Medical Texts

https://doi.org/10.18653/v1/2021.naacl-main.395

Devaraj, Ashwin; Marshall, Iain; Wallace, Byron; Li, Junyi Jessy (January 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.
more » « less
Full Text Available
Evidence Inference 2.0: More Data, Better Models

DeYoung, Jay; Lehman, Eric; Nye, Ben; Marshall, Iain J.; Wallace, Byron C. (July 2020, BioNLP: Workshop on Biomedical Natural Language Processing)

Full Text Available
Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time

Nye, Benjamin E.; Nenkova, Ani; Marshall, Iain J.; Wallace, Byron C. (January 2020, Proceedings of the Association for Computational Linguistics (ACL))

Full Text Available

« Prev Next »

Search for: All records